NaN problem of `nn.MultiHeadAttention` in PyTorch

1 minute read

Published: August 22, 2024

`MultiHeadAttention` `NaN` 问题

nn.MultiheadAttention causes gradients to become NaN under some use cases · Issue #41508 · pytorch/pytorch · GitHub 这几天持续跟踪了一下 pytorch 实现的 nn.MultiHeadAttention 计算过程中出现 NaN 的问题。根本原因是 tokenizer 在左侧增加 padding token（只能在左侧加，在右侧加是错误的，LLM 自回归生成，无法跟在 padding token 后面继续生成），导致 causal mask 和 padding mask 合并之后存在 attention matrix 前几行整行被 mask 的情况。pytorch 对于被 mask 部分的处理方式是填充 float("-inf")，导致经过 softmax 计算之后，整行都是 NaN。

解决方案

tokenizer 从右侧增加 padding token，不可行
pytorch 支持2种数据类型的 mask: float mask 和 bool mask，会自动对 bool mask 填充为 float("-inf")，而 float mask 不进行修改。所以可以自行将需要 padding 的位置填充好一个很小的值，比如 -1e8。
在 forward 阶段不对 padding token 进行任何处理，只在计算 loss 时忽略 padding token，即 loss_fn = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
其他注意到的事
qwen 的 tokenizer.padding_side=”right”，导致使用 batch 进行 completion 时，较短的句子，即增加了 padding token 的句子，completion 效果极差。如下 ```python import torch from transformers import AutoModelForCausalLM, AutoTokenizer

model_pth = “qwen/Qwen2-1.5B-Instruct” model = AutoModelForCausalLM.from_pretrained(model_pth, torch_dtype=torch.bfloat16, device_map=”cuda”) tokenizer = AutoTokenizer.from_pretrained(model_pth)

inference

in_text = [“Beijing is the captical of “, “Apple”] in_tokens = tokenizer(in_text, return_tensors=”pt”, padding=True)[“input_ids”] print(“bos:”, tokenizer.bos_token_id, tokenizer.bos_token) # bos: None None print(“eos:”, tokenizer.eos_token_id, tokenizer.eos_token) # eos: 151645 <|im_end|> print(“pad:”, tokenizer.pad_token_id, tokenizer.pad_token) # pad: 151643 <|endoftext|> print(“in_tokens:”) for x in in_tokens: print(x)

tensor([ 3430, 23649, 374, 279, 6427, 938, 315, 220])

tensor([ 26567, 151643, 151643, 151643, 151643, 151643, 151643, 151643])

outputs = model.generate(in_tokens.to(“cuda”)) for x in outputs: print(x)

tensor([ 3430, 23649, 374, 279, 6427, 938, 315, 220, 30743,

16, 2130, 323, 220, 17, 15, 339, 23631, 13,

576, 3283], device=’cuda:0’)

tensor([ 26567, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 2,

16, 17, 18, 271, 22599, 14363, 1965, 7591, 62,

21, 19], device=’cuda:0’)

output_text = tokenizer.batch_decode(outputs) for x in outputs_text: print(x)

Beijing is the captical of 1 Beijing. The city has a population of over

Apple<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>## 50. If (a) is a positive

2. deepseek 的模型采用的似乎是 Llama 模型，不清楚是不是完全没改模型结构。
```python
import torch
from transformers import AutoModelForCausalLM

model_pth = "deepseek-ai/deepseek-coder-1.3b-instruct"
model = AutoModelForCausalLM.from_pretrained(model_pth, torch_dtype=torch.bfloat16, device_map="cuda")
print(type(model)) # <class 'transformers.models.llama.modeling_llama.LlamaForCausalLM'>

Share on

Twitter Facebook LinkedIn

Zhen Wang / 汪震

NaN problem of `nn.MultiHeadAttention` in PyTorch

`MultiHeadAttention` `NaN` 问题

其他注意到的事

inference

tensor([ 3430, 23649, 374, 279, 6427, 938, 315, 220])

tensor([ 26567, 151643, 151643, 151643, 151643, 151643, 151643, 151643])

tensor([ 3430, 23649, 374, 279, 6427, 938, 315, 220, 30743,

16, 2130, 323, 220, 17, 15, 339, 23631, 13,

576, 3283], device=’cuda:0’)

tensor([ 26567, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 2,

16, 17, 18, 271, 22599, 14363, 1965, 7591, 62,

21, 19], device=’cuda:0’)

Beijing is the captical of 1 Beijing. The city has a population of over

Apple<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>## 50. If (a) is a positive

Share on

You May Also Enjoy

How to build a personal homepage by academic pages?

Hello!

Zhen Wang / 汪震

MultiHeadAttention NaN 问题

其他注意到的事

inference

tensor([ 3430, 23649, 374, 279, 6427, 938, 315, 220])

tensor([ 26567, 151643, 151643, 151643, 151643, 151643, 151643, 151643])

tensor([ 3430, 23649, 374, 279, 6427, 938, 315, 220, 30743,

16, 2130, 323, 220, 17, 15, 339, 23631, 13,

576, 3283], device=’cuda:0’)

tensor([ 26567, 151643, 151643, 151643, 151643, 151643, 151643, 151643, 2,

16, 17, 18, 271, 22599, 14363, 1965, 7591, 62,

21, 19], device=’cuda:0’)

Beijing is the captical of __1__ Beijing. The city has a population of over

Apple<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>## 50. If (a) is a positive

Share on

You May Also Enjoy

How to build a personal homepage by academic pages?

Hello!

`MultiHeadAttention` `NaN` 问题

Beijing is the captical of 1 Beijing. The city has a population of over